[PyTorch] Add max_logit support for MuonClip #2195

cyanguwa · 2025-09-22T19:09:36Z

Description

This PR adds max_logit support in FusedAttention and UnfusedDotProductAttention backends in TE-PyTorch. max_logit is the max of mask(Q x K^T x scale + bias), and it is used by MuonClip optimizer to rescale Q and K projection weights.

This PR supports FP16, BF16 precisions and BSHD, SBHD formats. It supports non-CP and CP (cp_comm_type = {"p2p", "a2a", "a2a+p2p", "all_gather"}) cases.

It contains a breaking change: adding return_max_logit to the nvte_get_fused_attn_backend, nvte_fused_attn_fwd and nvte_fused_attn_bwd. TE will pack up the tensor and non-tensor arguments in these APIs as structs in the future, in order to avoid breaking changes like this.

The support for THD is also implemented in this PR and will be enabled when cuDNN supports it.

This PR requires cuDNN 9.13.1 and cudnn-frontend 1.15.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Add max_logit support in FusedAttention and UnfusedDotProductAttention

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Charlene Yang <[email protected]>

tests/pytorch/attention/test_attention.py

Signed-off-by: Charlene Yang <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Charlene Yang <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Charlene Yang <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Charlene Yang <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Charlene Yang <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Charlene Yang <[email protected]>

for more information, see https://pre-commit.ci

cyanguwa · 2025-10-15T13:54:27Z

/te-ci L1

cyanguwa · 2025-10-19T06:18:31Z

/te-ci L1

Signed-off-by: Charlene Yang <[email protected]>

BoxiangW · 2025-10-21T17:51:03Z

LGTM thx

Signed-off-by: Charlene Yang <[email protected]>

for more information, see https://pre-commit.ci

cyanguwa · 2025-10-22T08:37:46Z

/te-ci L1

skyw

@BoxiangW has reviewed and LGTM.

Approving.

BoxiangW · 2025-10-22T21:05:05Z

One more thing on this PR, I think we agreed before on changing the naming into max_logit or max_qk_logit since it represent the value before softmax op

Signed-off-by: Charlene Yang <[email protected]>

cyanguwa · 2025-10-23T15:27:04Z

/te-ci L1

vcherepanov-nv · 2025-10-23T17:35:15Z

LGTM

Signed-off-by: Charlene Yang <[email protected]>

cyanguwa · 2025-10-24T11:08:31Z

/te-ci L1

* add max_score for fused/unfused F16 non-CP Signed-off-by: Charlene Yang <[email protected]> * calculate max per head instead of max over all heads Signed-off-by: Charlene Yang <[email protected]> * fix fused attn max_score shape Signed-off-by: Charlene Yang <[email protected]> * revert FE to github Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update FE to 1.15.0-rc Signed-off-by: Charlene Yang <[email protected]> * fix merge Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * reduce ew kernels; fix causal masks; add more tests Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor fix to tests Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove logic for flash-attn Signed-off-by: Charlene Yang <[email protected]> * WIP: add CP support for p2p/a2a/all_gather Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor improvements of implementation/tests Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * WIP: add thd support Signed-off-by: Charlene Yang <[email protected]> * add thd to UnfusedDPA Signed-off-by: Charlene Yang <[email protected]> * fix lint Signed-off-by: Charlene Yang <[email protected]> * more fixes for lint Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update to FE 1.15 Signed-off-by: Charlene Yang <[email protected]> * remove unneeded changes Signed-off-by: Charlene Yang <[email protected]> * disable unfused for thd + pad_between_seqs Signed-off-by: Charlene Yang <[email protected]> * minor fixes Signed-off-by: Charlene Yang <[email protected]> * disable thd for unfused until bug is fixed Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix all_gather Signed-off-by: Charlene Yang <[email protected]> * fix all gather Signed-off-by: Charlene Yang <[email protected]> * rename max_score to max_logit Signed-off-by: Charlene Yang <[email protected]> * fix all_gather Signed-off-by: Charlene Yang <[email protected]> * fix all_gather Signed-off-by: Charlene Yang <[email protected]> * disable fused attn + thd Signed-off-by: Charlene Yang <[email protected]> --------- Signed-off-by: Charlene Yang <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

add max_score for fused/unfused F16 non-CP

66c6627

Signed-off-by: Charlene Yang <[email protected]>

cyanguwa added the 2.9.0 label Sep 22, 2025

skyw reviewed Sep 26, 2025

View reviewed changes

tests/pytorch/attention/test_attention.py Outdated Show resolved Hide resolved

cyanguwa and others added 17 commits September 30, 2025 11:22

calculate max per head instead of max over all heads

8f9155f

Signed-off-by: Charlene Yang <[email protected]>

fix fused attn max_score shape

efaf827

Signed-off-by: Charlene Yang <[email protected]>

revert FE to github

290dfb9

Signed-off-by: Charlene Yang <[email protected]>

Merge branch 'main' into add_muon

b1f300e

Signed-off-by: Charlene Yang <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

c93518b

for more information, see https://pre-commit.ci

update FE to 1.15.0-rc

63a7f79

Signed-off-by: Charlene Yang <[email protected]>

fix merge

78d5426

Signed-off-by: Charlene Yang <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

437219b

for more information, see https://pre-commit.ci

reduce ew kernels; fix causal masks; add more tests

2adb1f2

Signed-off-by: Charlene Yang <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

bc7d6b0

for more information, see https://pre-commit.ci

minor fix to tests

7946127

Signed-off-by: Charlene Yang <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

f984602

for more information, see https://pre-commit.ci

remove logic for flash-attn

cb01843

Signed-off-by: Charlene Yang <[email protected]>

WIP: add CP support for p2p/a2a/all_gather

966b657

Signed-off-by: Charlene Yang <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

9d614a8

for more information, see https://pre-commit.ci

minor improvements of implementation/tests

8e62fa6

Signed-off-by: Charlene Yang <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

69b7ae8

for more information, see https://pre-commit.ci

cyanguwa requested a review from ptrendx October 15, 2025 13:52

Merge branch 'main' into add_muon

6e40f17

Merge branch 'main' into add_muon

326a54c

WIP: add thd support

84a67b3

Signed-off-by: Charlene Yang <[email protected]>

cyanguwa force-pushed the add_muon branch from 3e34552 to 84a67b3 Compare October 20, 2025 10:48

add thd to UnfusedDPA

1b64526

Signed-off-by: Charlene Yang <[email protected]>

cyanguwa force-pushed the add_muon branch from 7a7cbdb to 1b64526 Compare October 21, 2025 12:42

fix lint

c8b3bea

Signed-off-by: Charlene Yang <[email protected]>

cyanguwa and others added 7 commits October 21, 2025 17:55

Merge branch 'main' into add_muon

d6055ac

update to FE 1.15

5bc407a

Signed-off-by: Charlene Yang <[email protected]>

remove unneeded changes

aee07fa

Signed-off-by: Charlene Yang <[email protected]>

disable unfused for thd + pad_between_seqs

5703783

Signed-off-by: Charlene Yang <[email protected]>

minor fixes

e84a843

Signed-off-by: Charlene Yang <[email protected]>

disable thd for unfused until bug is fixed

f92ed68

Signed-off-by: Charlene Yang <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

cde5f1f

for more information, see https://pre-commit.ci

cyanguwa requested a review from skyw October 22, 2025 08:36

skyw approved these changes Oct 22, 2025

View reviewed changes

cyanguwa changed the title ~~[PyTorch] Add max_score support for MuonClip~~ [PyTorch] Add max_logit support for MuonClip Oct 23, 2025

cyanguwa added 5 commits October 23, 2025 05:18

Merge branch 'main' into add_muon

795c4ac

fix all_gather

9b0af47

Signed-off-by: Charlene Yang <[email protected]>

fix all gather

a0ccabb

Signed-off-by: Charlene Yang <[email protected]>

rename max_score to max_logit

2872dd3

Signed-off-by: Charlene Yang <[email protected]>

fix all_gather

200d98f

Signed-off-by: Charlene Yang <[email protected]>

cyanguwa requested a review from mk-61 October 23, 2025 15:17

fix all_gather

4aadbc6

Signed-off-by: Charlene Yang <[email protected]>

cyanguwa added 2 commits October 24, 2025 03:23

Merge branch 'main' into add_muon

03f1734

disable fused attn + thd

eb580f3

Signed-off-by: Charlene Yang <[email protected]>

vcherepanov-nv approved these changes Oct 24, 2025

View reviewed changes

BoxiangW approved these changes Oct 24, 2025

View reviewed changes

BoxiangW mentioned this pull request Oct 24, 2025

MuonClip support (non-split version) NVIDIA/Megatron-LM#1929

Open

cyanguwa merged commit 87cb26c into NVIDIA:main Oct 25, 2025
47 of 53 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[PyTorch] Add max_logit support for MuonClip #2195

[PyTorch] Add max_logit support for MuonClip #2195

Uh oh!

cyanguwa commented Sep 22, 2025 •

edited

Loading

Uh oh!

Uh oh!

cyanguwa commented Oct 15, 2025

Uh oh!

cyanguwa commented Oct 19, 2025

Uh oh!

BoxiangW commented Oct 21, 2025

Uh oh!

cyanguwa commented Oct 22, 2025

Uh oh!

skyw left a comment

Uh oh!

BoxiangW commented Oct 22, 2025 •

edited

Loading

Uh oh!

cyanguwa commented Oct 23, 2025

Uh oh!

vcherepanov-nv commented Oct 23, 2025

Uh oh!

cyanguwa commented Oct 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[PyTorch] Add max_logit support for MuonClip #2195

[PyTorch] Add max_logit support for MuonClip #2195

Uh oh!

Conversation

cyanguwa commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

Uh oh!

cyanguwa commented Oct 15, 2025

Uh oh!

cyanguwa commented Oct 19, 2025

Uh oh!

BoxiangW commented Oct 21, 2025

Uh oh!

cyanguwa commented Oct 22, 2025

Uh oh!

skyw left a comment

Choose a reason for hiding this comment

Uh oh!

BoxiangW commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cyanguwa commented Oct 23, 2025

Uh oh!

vcherepanov-nv commented Oct 23, 2025

Uh oh!

cyanguwa commented Oct 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cyanguwa commented Sep 22, 2025 •

edited

Loading

BoxiangW commented Oct 22, 2025 •

edited

Loading